Making Non-Distributed Databases, Distributed Ioannis - - PowerPoint PPT Presentation

making non distributed databases distributed ioannis
SMART_READER_LITE
LIVE PREVIEW

Making Non-Distributed Databases, Distributed Ioannis - - PowerPoint PPT Presentation

Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari Dynomite Ecosystem Dynomite - Proxy layer Dyno - Client Dynomite-manager - Ecosystem orchestrator Dynomite-explorer - UI Problems &


slide-1
SLIDE 1

Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari

slide-2
SLIDE 2

Dynomite Ecosystem

  • Dynomite - Proxy layer
  • Dyno - Client
  • Dynomite-manager - Ecosystem orchestrator
  • Dynomite-explorer - UI
slide-3
SLIDE 3
  • Needed a data store:
  • Scalable & highly available
  • High throughput, low latency
  • Netflix use case is active-active
  • Master-slave storage engines:
  • Do not support bi-directional replication
  • Cannot withstand a Monkey attack
  • Cannot easily perform maintenance

Problems & Observations

slide-4
SLIDE 4

What is Dynomite?

A framework that makes non-distributed data stores,

  • distributed. Can be used with many key-value storage

engines Features: highly available, automatic failover, node warmup, tunable consistency, backups/restores

slide-5
SLIDE 5

Dynomite @ Netflix

  • Running around 2.5 years in PROD
  • 70 clusters
  • ~1000 nodes used by internal microservices
  • Microservices based on Java, Python,

NodeJS

slide-6
SLIDE 6

Pluggable Storage Engines

RESP

  • Layer on top of a non-

distributed key value data store ○ Peer-peer, Shared Nothing ○ Auto-Sharding ○ Multi-datacenter ○ Linear scale ○ Replication ○ Gossiping

RESP

slide-7
SLIDE 7
  • Each rack contains one

copy of data, partitioned across multiple nodes in that rack

  • Multiple Racks == Higher

Availability (HA)

Topology

slide-8
SLIDE 8

Replication

  • A client can connect to any node on

the Dynomite cluster when sending requests.

  • If node owns the data,

▪ data are written in local data-store and asynchronously replicated.

  • If node does not own the data

▪ node acts as a coordinator and sends the data in the same rack & replicates to

  • ther nodes in other racks

and DC.

slide-9
SLIDE 9

Dyno Client - Java API

  • Connection Pooling
  • Load Balancing
  • Effective failover
  • Pipelining
  • Scatter/Gather
  • Metrics, e.g. Netflix Insights
slide-10
SLIDE 10

Dyno Load Balancing

  • Dyno client employs token

aware load balancing.

  • Dyno client is aware of the

cluster topology of Dynomite within the region, can write to specific node using consistent hashing.

slide-11
SLIDE 11

Dyno Failover

  • Dyno will route

requests to different racks in failure scenarios.

slide-12
SLIDE 12

Dynomite on the Cloud

RESP

slide-13
SLIDE 13

Moving across engines

Rack A Rack B

slide-14
SLIDE 14

Dynomite-manager: Warm up

  • 1. Dynomite-manager identifies which node has the same token in the

same DC

  • 2. Leverage master/slave replication
  • 3. Checks for peer syncing

a. difference between master and slave offset

  • 4. Once master and slave are in sync, Dynomite is set to allow write only
  • 5. Dynomite is set back to normal state
  • 6. Checks for health of the node - Done!
slide-15
SLIDE 15

Dynomite-Explorer (UI)

  • Node.js web app with a Polymer-based user-interface
  • Support Redis’ rich data types
  • Avoid operations that can negatively impact Redis server performance
  • Extended for Dynomite awareness
  • Allow extension of the server to integrate with the Netflix ecosystem
slide-16
SLIDE 16

Dynomite-Explorer

slide-17
SLIDE 17

Roadmap

  • Data reconciliation & repair v2
  • Optimizations of RocksDB configuration
  • Optimizing backups through SST
  • Others….
slide-18
SLIDE 18

More information

  • Netflix OSS:
  • https://github.com/Netflix/dynomite
  • https://github.com/Netflix/dyno
  • https://github.com/Netflix/dynomite-

manager

  • Chat: https://gitter.im/Netflix/dynomite
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21

Dynomite: S3 backups/restores

  • Why?
  • Disaster recovery
  • Data corruption
  • How?
  • Storage dumps data on the instance drive
  • Dynomite-manager sends data to S3 buckets
  • Data per node are not large so no need for incrementals.
  • Use case:
  • clusters that use Dynomite as a storage layer
  • Not enabled in clusters that have short TTL or use Dynomite as a

cache

slide-22
SLIDE 22

Dynomite-manager

  • Token management for multi-region deployments
  • Support AWS environment
  • Automated security group update in multi-region environment
  • Monitoring of Dynomite and the underlying storage engine
  • Node cold bootstrap (warm up)
  • S3 backups and restores
  • REST API
slide-23
SLIDE 23

Performance Setup

  • Instance Type:

○ Dynomite: i3.2xlarge with NVMe ○ NDBench: m2.2xls (typical of an app@Netflix)

  • Replication factor: 3

○ Deployed Dynomite in 3 zones in us-east-1 ○ Every zone had the same number of servers

  • Demo app used simple workloads key/value pairs

○ Redis: GET and SET

  • Payload

○ Size: 1024 Bytes ○ 80%/20% reads over writes

slide-24
SLIDE 24

Throughput

slide-25
SLIDE 25

Latencies

slide-26
SLIDE 26

Consistency

  • DC_ONE
  • Reads and writes are propagated synchronously only to the node in local rack

and asynchronously replicated to other racks and data centers

  • DC_QUORUM
  • Reads and writes are propagated synchronously to quorum number of nodes

in the local data center and asynchronously to the rest. The DC_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum is calculated, and then rounded down to a whole number. If all responses are different the first response that the coordinator received is returned.

  • DC_SAFE_QUORUM
  • Similarly to DC_QUORUM, but the operation succeeds only if the read/write

succeeded on a quorum number of nodes and the data checksum matches. If the quorum has not been achieved then an error response is generated by Dynomite.

slide-27
SLIDE 27

Deploying Dynomite in PROD

  • Unit testing in Github
  • Building EC2 AMI in “experimental”
  • Pipelines for performance analysis
  • Promotion to “candidate”
  • Beta Testing
  • Promotion to “release”
slide-28
SLIDE 28

Reconciliation

  • Reconciliation is based timestamps (newest wins) and

is performed by a Spark cluster

  • Jenkins job to avoid clock skewness
slide-29
SLIDE 29

Reconciliation: Design Principles

We would prefer to alleviate the processing load of performing the reconciliation from each node in the cluster and off load it to a high performance computation in memory cluster based on Spark.

slide-30
SLIDE 30

Reconciliation: Architecture

  • Forcing Redis (or any other storage

engine) to dump data to the disk

  • Encrypted communication between

Dynomite and Spark cluster

  • Chunking the data - retry in case of

a failure.

  • Bandwidth Throttler