[PPT] - Consistency of NoSQL Models Au Tran, Thy Nguyen, Chaz Chang, PowerPoint Presentation

SLIDE 1

Consistency of NoSQL Models

Au Tran, Thy Nguyen, Chaz Chang, Vijaypal Singh, Timothy To, Akash Budholia

SLIDE 2

Introduction: From RBDMS to NoSQL

In the past, ACID (Atomicity, Consistency, Isolation, and Durability) was a

must have requirement for all traditional monolithic database systems. Strong consistency was a must have for any database system, which only

ffers vertical-scalability and prevents horizontal-scalability.
As the demand grow, the need to scale for high availability become
necessary. For this reason, strong consistency can no longer be enforced

and databases must relax their consistency levels. Therefore, the NoSQL databases systems have emerged.

Strong consistency High Availability

SLIDE 3

Introduction: Sub-categories of NoSQL Databases

Redis Key-value store Cassandra Column store MongoDB Document Neo4j Graph database OrientDB Multi-model In this study, we compare the consistency models of five most popular non-cloud database systems: Redis, Cassandra, MongoDB,

rientDB and Neo4j.

SLIDE 4

Introduction: Data-centric and Client-centric

SLIDE 5

Consistency Models

We will review 8 main consistency models:

Strong consistency
Weak consistency
Eventual consistency
Causal consistency
Read-your-writes consistency
Session consistency
Monotonic Reads consistency
Monotonic Writes consistency

SLIDE 6

Strong consistency vs Weak consistency

Strong consistency (a.k.a Linearization)

Operation: must be

committed immediately => events in order and same data state for all clients

Read operation: After all write

commits are done => new version of data Weak consistency

Does not guarantee specific
rder of events
Read Operation: does not

guarantee to have the most updated value

Inconsistency window: The

time period between the write

peration and when every read
peration returns the updated

value

SLIDE 7

Eventual consistency

Eventual consistency strengths Weak Consistency.
In this model, it is possible for read operations to retrieve the
lder version instead of the latest one, like Weak Consistency

while the replicas converge to the same data state

However, after the inconsistency window, the latest data will be

retrieved.

Strong consistency Weak Consistency Eventual Consistency

SLIDE 8

Causal Consistency

If some process updates a given object:

○ Processes acknowledge the update: get updated value ○ Processes do not acknowledge the update: follow Eventual Consistency Model Weak Consistency Eventual Consistency Causal Consistency Sequential Consistency

SLIDE 9

Read-your-writes Consistency

Read-your-writes consistency allows ensuring that a replica is at least

current enough to have the changes made by a specific transaction.

If some process updates a

given object, this same process will always consider the updated value.

Other processes will

eventually read the updated value after the inconsistency window

SLIDE 10

Session Consistency

In the context of the existence of a session, read-your-writes

consistency model will be applied.

All reads are current writes from that session, but writes from other

sessions may lag.

Data from other sessions come in the correct order, just isn’t guaranteed

to be current.

Good performance and good availability at half the cost of strong

consistency

SLIDE 11

Monotonic Reads Consistency

After a process reads some value, all the

successive reads will return that same value

r a more recent one.
Monotonic reads ensure that if a process

performs read x1, then x2, then x2 cannot

bserve a state prior to the writes which

were reflected in x1; intuitively, reads cannot go backward.

Monotonic

reads do not apply to

perations

performed by different processes, only reads by the same process.

SLIDE 12

Monotonic Writes Consistency

A write operation invoked by a process on

a given object needs to be completed before any subsequent write operation

n the same object by the same process.
Monotonic writes ensure that if a process

performs write w1, then w2, then all processes observe w1 before w2.

Monotonic writes do not apply to
perations performed by different

processes, only writes by the same process.

SLIDE 13

Redis

Description from the official website (https://redis.io/): "Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries"

SLIDE 14

Redis - Background

Key-value store
Optimizes data in memory by:

○ prioritizing high performance ○ low computation complexity ○ high memory space efficiency ○ low application network traffic

Guarantees high availability by extending its architecture and introducing

the Redis Cluster

Strong consistent on a single instance configuration
Eventual Consistent in a cluster when the client reads from replica nodes

SLIDE 15

Redis - Cluster specification

High performance and linear scalability up to 1000 nodes.
Relaxed write guarantees: Redis Cluster tries its best to retain all write
perations issued by the application, but some of these operations can be

lost.

Availability: Redis Cluster survives network partitions as long as the

majority of the master nodes are reachable and there is at least one reachable slave for every master node that is no longer reachable.

SLIDE 16

Redis - Keys and master-slave model

Redis Cluster distributes keys into 16384 hash slots.
Each master stores a subset of the 16384 slots.
To compute the hash slot of a given key, the formula below is used

(CRC16 used as a hash algorithm): HASH_SLOT = CRC16(key) mod 16384

Architecture implements a master-slave model without proxies which

means that the application is redirected to the node that has the requested data. Redis nodes do not intermediate responses.

Each master node holds a hash slot. This slot has 1 to N replicas (including

the master and its replica nodes).

SLIDE 17

Redis - Hash tags

Hash tags ensure that two keys are allocated in the same slot
Allows for multi-key operations
Part of the key has to be a common substring between the two keys and

inside brackets

These two keys end up in the same slot because only the substring inside

the brackets will be hashed {user:1000}following {user:1000}followers

SLIDE 18

Redis - Redis Cluster

Redis Cluster is formed by N nodes

connected by TCP connections.

Each node has N-1 outgoing

connections and N-1 incoming connections.

A connection is kept alive as long

as the two connected nodes live. TCP connection Node

SLIDE 19

Redis - Asynchronous replication and writes

When a master node receives

an application issued request, it handles it and asynchronously propagates any changes to its replicas.

master node by default

acknowledges the application without an assured replication.

S M W1 W1 S M W2 W1 OK to app

SLIDE 20

Redis - Asynchronous replication and writes

On the asynchronous replication configuration (default), if the master

node dies before replicating and after acknowledging the client, the data is permanently lost. Therefore, the Redis Cluster is not able to guarantee write persistence at all times.

This behavior can be overwritten by explicitly making a request using the

WAIT command, but this profoundly compromises performance and scalability— the two main strong points of using Redis.

SLIDE 21

Redis - Node failure

Suppose we have a master node A

and a single replica A1.

If A fails, A1 is promoted to master,

and the cluster will continue to

perate.
However, if A has no replicas or A and

A1 fail at the same time, the Redis Cluster will not be able to continue

perating.

S M A A1 M A A1

SLIDE 22

Redis - Network partition

S M A A1 S A2 M S A A1 S A2

SLIDE 23

Redis - Network partition

In the case of a network partition event, if the client is on the minority side

with master A, while on the majority side resides its replicas A1 and A2, if the partition holds for too long (NODE_TIMEOUT) the majority side starts an election process to elect a new master among them, either A1 or A2.

Node A is also aware of the timeout and its role change from master to
slave. Consequently, it will refuse any further write operations from the

client.

In this case, Redis Cluster is not the best solution for applications that

require high-availability, such as large network partition events.

SLIDE 24

Redis - Replica migration

S M B B1 S B2 S M A A1 S M B A2 S B1 S M A A1

SLIDE 25

Redis - Replica migration

Suppose that the majority side has N nodes and A and B and its replicas,

A1, B1, and B2, respectively, and a network partition event occurs in such way that the replica A1 is separated from the rest.

If the partition lasts long enough for assuming A1 as unreachable, Redis

Cluster uses a strategy called replicas migration to reorganize the cluster and because B has multiple slaves, one of B’s replicas will now replicate from A and not from B.

SLIDE 26

Redis - Replica node read

There is also a possibility of reading from

replica nodes, instead of from master nodes in

rder to achieve a more read-scaled system.
By using the READONLY command, the client

assumes the possibility of reading stale data which is reasonable for situations where having the latest data is not critical.

Therefore, leads to an eventual consistency

model.

S M W2 W1 read W2

SLIDE 27

Cassandra

Column-based NoSQL store. Initially developed by Facebook to improve

their Inbox Search performance

Built with distributed systems in mind. Cassandra is on the AP (Availability

& Partition Tolerance) side of the CAP Theorem.

Can be configured to be a CP (Consistency & Partition Tolerance)

database

Become a strong consistent database when subjected to network

partitions

SLIDE 28

Cassandra

On default configuration, Cassandra is an AP database (Client may read

inconsistent data) but can be modified to behave like a CP database

SLIDE 29

Cassandra

Describes data with columns
A keyspace, corresponding to the database, is composed of

column-families

A column-family represent a class of objects such as Car or Person. Each

column-family has variable entries of objects call rows.

A row is identified by the partition key or the column key and hold an

arbitrarily large amount of columns.

One column contain a key-value pair and a timestamp to resolve

consistency conflicts

SLIDE 30

Cassandra

Scales up by distributing data across a cluster, or set, of nodes
Every node can handle client request. If the request data is not on

the node, the node become the “coordinator” responsible for retrieving the data from neighboring nodes and answering back to application

Partition data by hashing the row key, such that the largest hash

value wrapped over the smallest hash value to form a ring

SLIDE 31

Cassandra Write Consistency Levels

Designed to be eventually consistent, high availability, and low-latency. However, write consistency levels can be modify using configuration constants to satisfy user requirements

ALL - Write succeeds on all replica nodes in the cluster before they respond to the client.

(Strong Consistency, high latency)

ONE - Write succeeds on one replica node before responding to client (Eventual consistency, low

latency)

LOCAL_ONE - Write succeed son one replica node in the same data center as coordinator node

before responding to client (Eventual consistency, low latency)

ANY - A single replica may respond, or the coordinator may store a hint. If a hint is stored, the

coordinator will later attempt to replay the hint and deliver the mutation to the replicas. This consistency level is only accepted for write operations.

SLIDE 32

Cassandra Write Consistency Levels

Quorum - Write succeed on a given number of replica nodes before responding

back to client, the number is called the quorum. (Eventual consistency, low latency)

LOCAL_QUORUM - Write succeed on a given quorum of replica nodes in the

same data center as the coordinator node. (Eventual Consistency, Low Latency)

SLIDE 33

Cassandra Read Consistency Levels

Similar to Write Consistency Levels, the following configuration constants describe Cassandra read consistency levels:

ALL - Require all replica nodes to confirm the data before responding to the client. (Strong

consistency, less availability)

ONE - Retrieves the data from the first node to respond and returns that data to client.

(Eventual consistency, high availability)

LOCAL ONE - Retrieves data from the first node to respond in the same datacenter and

returns that data to client. (Eventual consistency, high availability)

QUORUM - Require a given number of nodes to respond to read request before responding

to client, that number is the QUORUM. (Eventual consistency, high-availability)

LOCAL QUORUM - Require a given number of nodes in the same datacenter to response to

read request before responding to client. (Eventual consistency, high-availability)

SLIDE 34

Cassandra Read Repair

On read configuration levels, other than ONE or LOCAL_ONE, Cassandra uses a Read Repair

routine to improve consistency.

A read request will performs a check on all of the queried replica nodes. For any replica nodes

with out-of-date value, Cassandra will issues read-repair requests to those nodes and update them to the latest value. After all Read-repair requests are done, then the Coordinator node responds back to the client

SLIDE 35

Cassandra Read Repair Chance

If your read consistency level is set to ONE or LOCAL_ONE, the coordinator node only looks for one node to respond to its request. Since only one version of the data is checked, Read Repair requests does not happen normally.

A read repair chance mechanism is in place for consistency level ONE or

LOCAL_ONE.

Given a Read Repair Chance of 10% and a Replication Factor of 3.
Approximately 10% of the reads will trigger a Read Repair requests and

make sure the latest data is propagated to all 3 replicas.

SLIDE 36

Cassandra - Probabilistically Bounded Staleness

Developed by UC Berkeley
A simulator to determine long it takes to reach 100% consistency under

different configurations or write and read consistency levels

The graph represents the probability of a client request receiving the

latest version of data over time (ms) for a given combination of available cluster hosts, read quorum, and write quorum.

We denote available cluster hosts as (N), read quorum as (R) , and write

quorum as (W)

All configurations for this simulation assume a Replication Factor above 1

SLIDE 37

Cassandra - Probabilistically Bounded Staleness

ALL Write Consistency Level or ALL Read Consistency Level

All nodes must respond before write/read requests respond to client
Strong consistency

SLIDE 38

Cassandra - Probabilistically Bounded Staleness

ONE Read Consistency Level and QUORUM Write Consistency Level

Need 3 node to respond to write and 1 node to respond to read
Eventually Consistent with low chance of inconsistent data being returned

SLIDE 39

Cassandra - Probabilistically Bounded Staleness

QUORUM Read Consistency Level and ONE Write Consistency Level

Read request requires 3 node to respond while write request requires one
Time to reaches 100% consistency is only 4ms, much shorter than other configurations
Eventually Consistent

SLIDE 40

Cassandra - Probabilistically Bounded Staleness

ONE Read Consistency Level and ONE Write Consistency Level

One node to respond to both read/write requests
Significantly higher chance of returning out-of-date data compared to other configuration
Can shorten time it takes to reach 100% consistency by increasing read repair chance and lowering Replication Factor
‘Strongest’ form of eventual consistency in Cassandra

SLIDE 41

MongoDB

Inspired by the limitations of RDBMS
Expressive Query Language, secondary indexes, strong consistency

are taken from RDBMS

Schema-less and easier horizontal scalability are the NoSQL

concepts

Document based data model
Documents are BSON format
Related documents are organized as collections

SLIDE 42

MongoDB Sharding

Sharding allows horizontal scalability
Allows data to be distributed among many

nodes

Query router is responsible for redirecting

queries to the correct shard depending on the sharding strategy and shard value

Sharding uses config servers and mongos

to carry out its operations

SLIDE 43

MongoDB Sharding Strategy

Range-based Sharding:

○ Documents are distributed based on the shared-key values. ○ Shard-key value close to each other will most likely end up on the same shard.

SLIDE 44

MongoDB Sharding Strategy

Hash-based Sharding:

○ MD5 hash is used to hash the shard-keys ○ The idea behind hash-based is to distribute the data among shard evenly ○ Could be slower for ranged based queries ○ Good for monotonically increasing ids.

SLIDE 45

MongoDB Sharding Strategy

Location-aware Sharding:

○ Users can specify a custom configuration to accomplish application requirements. ○ For example, high-demanding data can be stored in-memory and less demanding data on disk

SLIDE 46

MongoDB ACID

Follows ACID properties similar to RDBMS:
Atomicity: supports single operation inserts and updates
Consistency: can be used on a strong consistency approach
Isolation: While a document is updated, it is entirely isolated. Any error

would result in a rollback operation, and no user will be reading stale data

Durability: MongoDB implements a feature called write concern. Write

concern are user-define policies that need to be fulfilled in order to commit

SLIDE 47

MongoDB Replication

Allows configuring a replica set
Replica set has one primary and multiple secondary replica set members
A heartbeat or ping is used to check the health of connections in a cluster
A primary member of a cluster is elected through an election
Election occurs:

○ When a new replica set is initiated ○ Primary steps down ○ Node failover where primary can’t reach majority of the secondaries

SLIDE 48

MongoDB Strong Consistency

Writes and reads are done from the

primary replica set member.

The primary member writes all the
perations of a transaction to oplog.
After the primary member

acknowledges the application of the committed data and operations logging, secondary replica set members can now read from this log and replay all

perations so that they can be on the

same state of the primary member.

Since applications can only read

from Primary, all the reads are consistent because read is done from the same node.

SLIDE 49

MongoDB Write Concern

Allows designers to configure how many nodes the data need to be committed before acknowledging it complete.

Write Concern 0 = no acknowledgement
WC 1 = only primary needs to acknowledge
WC N = N-1 members must replicate to

acknowledge

WC majority = majority of the members need to

replicate the data before acknowledging commit

WC majority ensures no rollbacks

Figure: The majority write concern in practice.

SLIDE 50

MongoDB Eventual Consistency

Applications are allowed to read from

secondary replica set members if they do not prioritize reading the latest data.

This can be achieved by specifying

secondary in read preference on query.

Reads from secondaries may return

data that does not reflect the state of the data on the primary

SLIDE 51

MongoDB Node Failover

If primary fails, an election occurs and secondary replicas elect new

primary

Raft consensus algorithm is used to elect new primary.
The replica with most updated data which can reach majority of the

nodes gets elected as primary and is then responsible for updating the

plog read by secondary members.
if primary recovers from the failover, it becomes secondary member of

the replica set.

SLIDE 52

MongoDB Oplog

Oplog has a configurable back-limit history.
It has 5% of the available disk space dedicated to store logs of

transactions.

In a case where secondary member fails to keep up with the primary and

the required transaction logs in the Oplog that the secondary to recover are replaced with newer transactions from primary, all the databases, collections and indexes directives are copied from the primary member or another secondary member.

The same process is done when a new member joins a replica set.

SLIDE 53

Neo4j: Schema Design

Neo4j is schema optional. It is not necessary to create indexes and
constraints. Nodes, relationships, and properties can be created without

defining a schema.

Labels define domains by grouping nodes into sets. Nodes that have the

same label belongs to the same set. For example, all nodes representing cars could be labeled with the same label: Car. This allows Neo4j to perform operations only within a specific label, such as finding all cars with a given brand.

SLIDE 54

Neo4j : Introduction

Neo4j is a reliable, scalable and

high-performing native graph database

Its proper ACID characteristics is

a foundation of data reliability

Neo4j ensures that operations

involving the modification of data happen within a transaction to guarantee consistent data.

SLIDE 55

Neo4j: The Graph NoSQL Database

In Neo4j, a graph is defined by a node and a relationship. A node

represents an entity (i.e., the entity Person). It can have several node

attributes. (i.e., the Person with the name “Alice”). Two entities can be

linked by a relationship (i.e., the Person with name “Alice” likes the Person with name “Bob”). Relationships can also have properties

Neo4j uses linked lists of fixed size record on disk. Each property record

holds a key/value. Each node or relationship references its first property

record. Relationships are stored in a doubly linked list. A node references

its first relationship.

SLIDE 56

Neo4j: Schema Design

Neo4j is schema optional. It is not necessary to create indexes and
constraints. Nodes, relationships, and properties can be created without

defining a schema.

Labels define domains by grouping nodes into sets. Nodes that have the

same label belongs to the same set. For e.g., all nodes representing cars could be labeled with the same label: Car. This allows Neo4j to perform

perations only within a specific label, such as finding all cars with a given

brand.

SLIDE 57

Neo4j : Causal Consistency

Neo4j’s Causal Clustering provides three main features:

1.

Safety: Core Servers provide a fault tolerant platform for transaction processing which will remain available while a simple majority of those Core Servers are functioning.

2.

Scale: Read Replicas provide a massively scalable platform for graph queries that enables very large graph workloads to be executed in a widely distributed topology.

3.

Causal consistency: when invoked, a client application is guaranteed to read at least its own writes.

SLIDE 58

Neo4j : Operational Overview

From an operational point of view, it is

useful to view the cluster as being composed of servers with two different roles: Cores and Read Replicas

The two roles are foundational in any

production deployment but are managed at different scales from one another and undertake different roles in managing the fault tolerance and scalability of the

verall cluster.

SLIDE 59

Neo4j: Core Servers

The main responsibility of Core Servers is to safeguard data.
Raft ensures that the data is safely durable before confirming transaction

commit to the end user application. Majority of Core Servers in a cluster (N/2+1) have accepted the transaction, it is safe to acknowledge the commit to end user application.

In a Core Server cluster, there are enough nodes to provide sufficient fault

tolerance for the specific deployment. This is calculated with the formula M = 2F + 1, where M is the number of Core Servers required to tolerate F faults.

If cluster suffer enough Core failures then it can no longer process writes

and it will become read-only to preserve safety.

SLIDE 60

Neo4j: Read Replicas

The main responsibility of Read Replicas is to scale out graph workloads.
Read Replicas act like caches for the graph data that the Core Servers

safeguard and are fully capable of executing arbitrary (read-only) queries and procedures.

Read Replicas are asynchronously replicated from Core Servers via

transaction log shipping. They will periodically poll an upstream server for new transactions and have these shipped over.

Losing a Read Replica does not impact the cluster’s availability, aside from

the loss of its fraction of graph query throughput.

SLIDE 61

Neo4j: Causal Consistency

While the operational mechanics of the cluster are

interesting from an application point of view

In an application we typically want to read from the

graph and write to the graph

Causal consistency makes it possible to write to Core

Servers (where data is safe) and read those writes from a Read Replica (where graph operations are scaled out)

On executing a transaction, the client can ask for a

bookmark which it then presents as a parameter to subsequent transactions

SLIDE 62

OrientDB

Multi-Model NoSQL database
Main data models are graphs and documents, but also support key/value

model

Supports strong, session and eventual consistencies through choice of

client load balancing configurations

Primarily favors the CA (Consistency and Availability) side of CAP Theorem

SLIDE 63

OrientDB - Replication

Originally employed a master-slave strategy, resulting in a not-so-scalable

architecture

Swapped to multi-master replication
Multi-master replication: data stored by group and can be updated by any

member of group

System then propagates data to rest of members and deals with conflicts

caused by concurrent changes

SLIDE 64

OrientDB - Sharding

Sharding is done at class level

using multiple clusters per class where each cluster has own list of server where data is replicated

All records stored in clusters are

part of the same class

Can also have multiple servers

per cluster where first server is it’s master

For each server, their records are

will be copied across all of them

SLIDE 65

OrientDB - Strong Consistency

Default Configuration: Sticky

configuration

Client hits the same master over

and over

Stays connected to same server

until the DB closes

In exchange for high consistency,

it sacrifices performance and has high latency

SLIDE 66

OrientDB - Session Consistency

Round Robin Connect

Configuration

Client connects to a different

server at each connection following a round robin schedule

Obviously strong performance

and availability provided it’s in the same session, but doesn’t have strong consistency with other sessions.

SLIDE 67

OrientDB - Eventual Consistency

Round Robin Request

Configuration

Client connects to a different

server at each request following a round robin schedule

Consistency takes a while, but

it’s low latency

Has the same scaling limitations

like MongoDB in this configuration

SLIDE 68

OrientDB - Concurrency

Uses Optimistic Concurrency Control
Used for both Atomic Operations and Transactions
Atomic Operations uses Multi-Version Concurrency Control (MVCC) in

OrientDB

Transactions don’t use locks, but checks record version to see if there

have been updates from other clients

SLIDE 69

OrientDB - Multi Version Concurrency Control

Occurs when two threads

attempting to update same record

Every update increments version

number on record

If thread updating record doesn’t

have the newest version number, fails to update and returns exception

SLIDE 70

Summary

We discussed different consistency implementations of several NoSQL databases.

SLIDE 71

Summary

Below is where the default configurations of each database system stands on the CAP theorem

Neo4j, OrientDB favors strong consistency and

availability

Cassandra favors availability, low latency and

network partition

MongoDB and Redis favor strong consistency

and network partition tolerance

SLIDE 72

Conclusion

Applications want high consistency and partition tolerance at the cost of

availability (writes), MongoDB is the best option

Applications that favor high availability and low latency over consistency

will want Cassandra

If you have the option of partition intolerance, both Neo4j and OrientDB

can provides high consistency

SLIDE 73