Making Non-Distributed Databases, Distributed Ioannis - - PowerPoint PPT Presentation

▶

Mar 12, 2023 263 likes •579 views

Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari Dynomite Ecosystem Dynomite - Proxy layer Dyno - Client Dynomite-manager - Ecosystem orchestrator Dynomite-explorer - UI Problems &

SLIDE 1

Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari

SLIDE 2

Dynomite Ecosystem

Dynomite - Proxy layer
Dyno - Client
Dynomite-manager - Ecosystem orchestrator
Dynomite-explorer - UI

SLIDE 3

Needed a data store:
Scalable & highly available
High throughput, low latency
Netflix use case is active-active
Master-slave storage engines:
Do not support bi-directional replication
Cannot withstand a Monkey attack
Cannot easily perform maintenance

Problems & Observations

SLIDE 4

What is Dynomite?

A framework that makes non-distributed data stores,

distributed. Can be used with many key-value storage

engines Features: highly available, automatic failover, node warmup, tunable consistency, backups/restores

SLIDE 5

Dynomite @ Netflix

Running around 2.5 years in PROD
70 clusters
~1000 nodes used by internal microservices
Microservices based on Java, Python,

NodeJS

SLIDE 6

Pluggable Storage Engines

RESP

Layer on top of a non-

distributed key value data store ○ Peer-peer, Shared Nothing ○ Auto-Sharding ○ Multi-datacenter ○ Linear scale ○ Replication ○ Gossiping

RESP

SLIDE 7

Each rack contains one

copy of data, partitioned across multiple nodes in that rack

Multiple Racks == Higher

Availability (HA)

Topology

SLIDE 8

Replication

A client can connect to any node on

the Dynomite cluster when sending requests.

If node owns the data,

▪ data are written in local data-store and asynchronously replicated.

If node does not own the data

▪ node acts as a coordinator and sends the data in the same rack & replicates to

ther nodes in other racks

and DC.

SLIDE 9

Dyno Client - Java API

Connection Pooling
Load Balancing
Effective failover
Pipelining
Scatter/Gather
Metrics, e.g. Netflix Insights

SLIDE 10

Dyno Load Balancing

Dyno client employs token

aware load balancing.

Dyno client is aware of the

cluster topology of Dynomite within the region, can write to specific node using consistent hashing.

SLIDE 11

Dyno Failover

Dyno will route

requests to different racks in failure scenarios.

SLIDE 12

Dynomite on the Cloud

RESP

SLIDE 13

Moving across engines

Rack A Rack B

SLIDE 14

Dynomite-manager: Warm up

1. Dynomite-manager identifies which node has the same token in the

same DC

2. Leverage master/slave replication
3. Checks for peer syncing

a. difference between master and slave offset

4. Once master and slave are in sync, Dynomite is set to allow write only
5. Dynomite is set back to normal state
6. Checks for health of the node - Done!

SLIDE 15

Dynomite-Explorer (UI)

Node.js web app with a Polymer-based user-interface
Support Redis’ rich data types
Avoid operations that can negatively impact Redis server performance
Extended for Dynomite awareness
Allow extension of the server to integrate with the Netflix ecosystem

SLIDE 16

Dynomite-Explorer

SLIDE 17

Roadmap

Data reconciliation & repair v2
Optimizations of RocksDB configuration
Optimizing backups through SST
Others….

SLIDE 18

More information

Netflix OSS:
https://github.com/Netflix/dynomite
https://github.com/Netflix/dyno
https://github.com/Netflix/dynomite-

manager

Chat: https://gitter.im/Netflix/dynomite

SLIDE 19

SLIDE 20

SLIDE 21

Dynomite: S3 backups/restores

Why?
Disaster recovery
Data corruption
How?
Storage dumps data on the instance drive
Dynomite-manager sends data to S3 buckets
Data per node are not large so no need for incrementals.
Use case:
clusters that use Dynomite as a storage layer
Not enabled in clusters that have short TTL or use Dynomite as a

cache

SLIDE 22

Dynomite-manager

Token management for multi-region deployments
Support AWS environment
Automated security group update in multi-region environment
Monitoring of Dynomite and the underlying storage engine
Node cold bootstrap (warm up)
S3 backups and restores
REST API

SLIDE 23

Performance Setup

Instance Type:

○ Dynomite: i3.2xlarge with NVMe ○ NDBench: m2.2xls (typical of an app@Netflix)

Replication factor: 3

○ Deployed Dynomite in 3 zones in us-east-1 ○ Every zone had the same number of servers

Demo app used simple workloads key/value pairs

○ Redis: GET and SET

Payload

○ Size: 1024 Bytes ○ 80%/20% reads over writes

SLIDE 24

Throughput

SLIDE 25

Latencies

SLIDE 26

Consistency

DC_ONE
Reads and writes are propagated synchronously only to the node in local rack

and asynchronously replicated to other racks and data centers

DC_QUORUM
Reads and writes are propagated synchronously to quorum number of nodes

in the local data center and asynchronously to the rest. The DC_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum is calculated, and then rounded down to a whole number. If all responses are different the first response that the coordinator received is returned.

DC_SAFE_QUORUM
Similarly to DC_QUORUM, but the operation succeeds only if the read/write

succeeded on a quorum number of nodes and the data checksum matches. If the quorum has not been achieved then an error response is generated by Dynomite.

SLIDE 27

Deploying Dynomite in PROD

Unit testing in Github
Building EC2 AMI in “experimental”
Pipelines for performance analysis
Promotion to “candidate”
Beta Testing
Promotion to “release”

SLIDE 28

Reconciliation

Reconciliation is based timestamps (newest wins) and

is performed by a Spark cluster

Jenkins job to avoid clock skewness

SLIDE 29

Reconciliation: Design Principles

We would prefer to alleviate the processing load of performing the reconciliation from each node in the cluster and off load it to a high performance computation in memory cluster based on Spark.

SLIDE 30

Reconciliation: Architecture

Forcing Redis (or any other storage

engine) to dump data to the disk

Encrypted communication between

Dynomite and Spark cluster

Chunking the data - retry in case of

a failure.

Bandwidth Throttler